2 research outputs found

    Deep-Learning Inferencing with High-Performance Hardware Accelerators

    Get PDF
    In order to improve their performance-per-watt capabilities over general-purpose architectures, FPGAs are commonly employed to accelerate applications. With the exponential growth of available data, machine-learning apps have generated greater interest in order to better understand that data and increase autonomous processing. As FPGAs become more readily available through cloud services like Amazon Web Services F1 platform, it is worth studying the performance of accelerating machine-learning apps on FPGAs over traditional fixed-logic devices, like CPUs and GPUs. FPGA frameworks for accelerating convolutional neural networks, which are used in many machine-learning apps, have started emerging for accelerated-application development. This thesis aims to compare the performance of these emerging frameworks on two commonly-used convolutional neural networks, GoogLeNet and AlexNet. Specifically, handwritten Chinese character recognition is benchmarked across multiple currently available FPGA frameworks on Xilinx and Intel FPGAs and compared against multiple CPU and GPU architectures featured on AWS, Google’s Cloud platform, the University of Pittsburgh’s Center for Research Computing (CRC), and Intel’s vLab Academic Cluster. All NVIDIA GPUs have proven to have the best performance over every other device in this study. The Zebra framework available for Xilinx FPGAs showed to have an average 8.3× and 9.3× better performance and efficiency, respectively, over the OpenVINO framework available for Intel FPGAs. Although the Zebra framework on the Xilinx VU9P showed better efficiency than the Pascal-based GPUs, the NVIDIA Tesla V100 proved to be the most efficient device at 125.9 and 47.2 images-per-second-per-Watt for AlexNet and GoogLeNet, respectively. Although currently lacking, FPGA frameworks and devices have the potential to compete with GPUs in terms of performance and efficiency

    Exploring ML-Oriented Hardware for Accelerated and Scalable Feature Extraction

    No full text
    Machine-learning (ML) algorithms, tools, and devices continually grow intending to automate and accelerate many aspects of daily life. Hardware accelerators can enable these ML apps to achieve maximized performance. The first phase of this dissertation explores the maximum throughput performance of field-programmable gate arrays (FPGAs), CPUs, and GPUs on two architecturally different convolutional neural networks (CNNs) that are comprised of similar fundamental neural-network operations: GoogLeNet and AlexNet. Because of their highly parallel nature, GPUs achieved the highest inference throughput across models and devices, where additional tensor acceleration significantly boosts performance. To better understand the design and impacts of ML-oriented hardware and software, the second phase of this dissertation analyzes the subsequent generations of high-performance and embedded devices that feature ML optimizations in terms of latency and throughput. Tensor, vision, and other ML-focused architectures are also considered. Because many of these devices feature hardware for quantized and reduced-precision datatypes, GoogLeNet and AlexNet are quantized with more modern ML frameworks for optimized performance with state-of-the-art backend acceleration software. Though GPUs dominate in throughput and FPGAs achieve the lowest latencies, all of the devices use significant compute, memory, and power resources to achieve their respective performance. The final phase of this dissertation explores neuromorphic technology as an alternative solution to ML object classification to reduce the overall compute, memory, and power required. Neuromorphic sensors capture events at a microsecond resolution as opposed to generating entire frames to limit the amount of redundant data captured. These events can be related spatially, through algorithms such as k-means clustering, or spatio-temporally, through neuromorphic algorithms such as "A Hierarchy Of event-based Time Surfaces" (HOTS). FPGA accelerators for k-means clustering and HOTS are designed and optimized using state-of-the-art high-level synthesis tools and evaluated on multiple datasets. The highly scalable k-means clustering accelerator achieved an event-processing latency of 65 nanoseconds and throughput of 15.38 MEvt/s while using less than 2% of available FPGA resources and being competitive in accuracy. This dissertation benchmarked many state-of-the-art hardware accelerators, analyzed the impacts of ML hardware and software optimizations, and developed an ultra-low-latency, scalable alternative to ML object classification with neuromorphic technology
    corecore